Assessment of colon cancer molecular mechanism: a system biology approach

Aim: The current study aimed to assess and compare colon cancer dysregulated genes from the GEO and STRING databases. Background: Colorectal cancer is known as the third most common kind of cancer and the second most important reason for global cancer-related mortality rates. There have been many studies on the molecular mechanism of colon cancer Methods: From the STRING database, 100 differentially expressed proteins related to colon cancers were retrieved and analyzed by network analysis. The central nodes of the network were assessed by gene ontology. The findings were compared with a GSE from GEO. Results: Based on data from the STRING database, TP53, EGFR, HRAS, MYC, AKT1, GAPDH, KRAS, ERBB2, PTEN, and VEGFA were identified as central genes. The central nodes were not included in the significant DEGs of the analyzed GSE. Conclusion: A combination of different database sources in system biology investigations provides useful information about the studied diseases.


Introduction
1 Colorectal cancer is known as the third most common kind of cancer and the second most important reason for global cancer-related mortality rates (1). It is one of the lethal cancers that is associated with problems in diagnosis as well as therapy (2). Many Bioinformatics is a critical field applied to create new concepts by using the analysis results of genomic and proteomic studies (5)(6)(7). Dysregulated metabolites, genes, and proteins in colon cancer patients have been studied using bioinformatics. In such studies, much is gathered from databanks or published articles and analyzed using bioinformatic tools (8)(9)(10). First, the diversity of data sources, and second, the multiplicity of analysis methods are interesting points about these studies. Based on the selected source and method of investigation, results can be different. It seems clear that an explanation of the investigation protocol is required to determine the most accurate findings (8,11). GEO is a useful source of data, including gene expression profiles of assessed samples. Many researchers select GEO as a source of data to analyze differentially expressed genes in a defined condition. GEO is not only suitable source of data, but it is also equipped with useful software such as GEO2R which helps the primary analysis of data. Fold change and statistical validation of data are two important findings from GEO. The style of gene regulation, i.e. up-or downregulation is accessible in GEO2R analysis of the studied DEGs (12,13). STRING is another useful source of data that provides the related dysregulated proteins in the studied condition. There are many published articles that are concerned with "disease query" of string. Combination of STRING and Cytoscape software is a powerful tool in the bioinformatic analysis of data (14,15). In the present investigation, dysregulated genes in human colon cancer were assessed by using one recorded experiment in GEO and STRING sources to elucidate the findings.

Methods
In this study, 100 proteins associated with colon cancer were extracted from the STRING database using the "disease query option." The proteins were interacted by Cytoscape software v 3.7.2 (16) by undirected edges, and the network comprising 100 nodes and 2811 links was constructed. The main connected components, including 95 nodes and 5 isolated proteins, were analyzed by the "NetworkAnalyzer" plug in of Cytoscape software. The network was visualized based on degree value by considering the color and size of the nodes.
Based on degree value, the 10 top nodes of the main connected component were selected as the hub nodes of the network. The hubs were included in the ClueGO v2.5.7 (17) application of Cytoscape to analyze gene ontology. The related pathways were extracted from KEGG 08.05.2020. A p-value ≤ 0.01 and network specificity; medium were applied to determine the pathways.
The GSE127069 of 6 patients, entitled "RNA sequencing for cancer tissues and adjacent tissues of third-stage rectal cancer patients with and without blood vascular thrombus" in GEO (18) was selected for analysis. The volcano plot of gene expression profiles of colon cancer tissue versus adjacent tissue was provided to statistically match the data. The top genes based on fold change (1.5<FC<-1.5) and p-value < 0.01 were selected as significant DEGs. The known genes were identified based on gene IDs from Uniport (https://www.uniprot.org).

Results
The network, including a main connected component (shown in Figure 1) and 5 isolated proteins, was constructed for the extracted data from the STRING database. Four centrality parameters, i.e. degree (K), betweenness centrality (BC), closeness centrality (CC), and stress, were determined for the nodes of the main connected component (Table 1). TP53, EGFR, HRAS, MYC, AKT1, GAPDH, KRAS, ERBB2, PTEN, and VEGFA were identified as hub nodes. Thirty-one dysregulated terms in 2 groups of pathways which were related to the hub nodes of the colon cancer network were identified. The pathways that are classified in the two groups and the related proteins are presented in Table 2.
The volcano plot of gene expression profiles of colon cancer tissue versus adjacent tissue for the analyzed GSE is presented in Figure 2. Based on the volcano plot, the samples are comparable. A list of the significant and known genes of the GEO analysis is given in Table 3. The top 21 rows of Table 3 refer to the downregulated genes, and the other 6 genes are upregulated.

Discussion
Many diseases contained in the STRING database have related dysregulated proteins listed. In this research, 100 proteins that are dysregulated in human colon cancers were retrieved. The data was organized in the protein-protein interaction unit (Figure 1). The constructed network analysis revealed that the network is a scale-free network, in which the number of limited nodes which are known as central nodes can be selected as critical nodes of the analyzed network (19). As shown in Table 1, the centrality parameters of nodes were determined. TP53, EGFR, HRAS, MYC, AKT1, GAPDH, KRAS, ERBB2, PTEN, and VEGFA are appeared as hub nodes of the assessed network. The hub genes are the important central nodes that can be discriminated from the other nodes of the network as critical individuals (20). Table 1, the other centrality parameters of the hub nodes are also high values; thus, it can be concluded that the hub nodes are potent hub-bottleneck nodes. A usual and simple analysis of data was conducted to find the critical nodes of the studied network. As represented in Table 2, the related pathways for the central nodes were identified through gene ontology analysis. It seems that a complete analysis of data is formed, and a useful interpretation is accessible. Based on previous investigations, TP53 is the top central gene related to colon cancer and known as a biomarker of many cancers (21). As specificity and sensitivity are the two main properties of biomarkers (22), it can be concluded that TP53 cannot be considered as a biomarker of colon cancer. Like TP53, the other introduced critical nodes are also related to different types of cancers. Thus, it can be concluded that the well-known data in the STRING database can be matched with various kinds of cancers. As reported, EGFR is a key element in colorectal cancers (23), and many documents point to EGFR as a biomarker of cancers such as head and neck squamous cell carcinomas and primary non-small cell lung cancer (24,25). In another part of the study, colon cancer tissue was compared with adjacent tissues. As depicted in Figure 2, the data indicated that analysis is possible. In total, 27 significant DEGs that discriminate cancerous tissue from the adjacent tissue were identified. In the first attempt, it was concluded that the evidence for a correlation between the findings and the results of STRING analysis is insufficient (Compare the contents of Table 3 and the introduced 10 central nodes). As the number of DEGs in the GEO analysis is limited to 27, inclusion of data in an interactome cannot be conducted to form a scale-free network.    The best way to analyze this set of genes is to add their first neighbors. STRING is a rich source of neighbors, and there are options in STRING that allow researchers to add an adequate number of the first neighbors to the queried genes. This mode of analysis enables the investigator to construct a scale-free network and analyze the queried DEGs. The discriminated values of centrality parameters for the queried genes, which were induced by the added first neighbors in addition to the fold change values, provide a clear concept for selecting the critical DEGs from among the studied genes. It can be concluded that each type of analysis is unique in its properties and findings. Based on researcher favorites, a study can be designed to obtain a different result that is useful from that point of view. Many studies have been concerned with this combination mode of analysis with different numbers of added first neighbors to discriminate the queried DEGs (26,27). The analysis of data from GEO and STRING sources revealed that each kind of analysis has its benefits; however, analysis using the sources separately also provided useful results. It seems that the combination mode of analysis is a suitable and more complete method for finding a clear concept and interpretation of the studied disease.